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Abstract 


We consider the problem of off-policy policy 
selection in reinforcement learning: using his- 
torical data generated from running one pol- 
icy to compare two or more policies. We show 
that approaches based on importance sampling 
can be unfair—they can select the worse of 
two policies more often than not. We give two 
examples where the unfairness of importance 
sampling could be practically concerning. We 
then present sufficient conditions to theoreti- 
cally guarantee fairness and a related notion of 
safety. Finally, we provide a practical impor- 
tance sampling-based estimator to help miti- 
gate one of the systematic sources of unfair- 
ness resulting from using importance sampling 
for policy selection. 


1 INTRODUCTION 


In this paper, we consider the problem of off-policy pol- 
icy selection: using historical data generated from run- 
ning one policy to compare two or more policies. Off- 
policy policy selection methods can be used, for exam- 
ple, to decide which policy should be deployed when two 
or more batch reinforcement learning (RL) algorithms 
suggest different policies or when a data-driven policy is 
compared to a policy designed by a human expert. The 
primary contribution of this paper is that we show that 


the importance sampling (IS) estimator 
(2000), which lies at the foundation of many policy selec- 


tion and policy search algorithms (Mandel et al. 2014} 
2013)/Thomas et al.||2015b), is often 
unfair when used for policy selection: when comparing 


two policies, the worse of the two policies may be re- 
turned more than half the time. 


After formalizing our notion of fairness, we show that 
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unfairness can occur in both the on-policy and off-policy 
settings. We further show that in the off-policy set- 
ting, using importance sampling for policy selection can 
be unfair in practically relevant settings. In particular, 
we show that IS can favor myopic policies—policies 
that obtain less total reward, but which obtain it early 
in an episode—as well as favoring policies that pro- 
duce shorter trajectories. Although IS is an unbiased 
estimator, this unfairness arises because policy selec- 
tion involves taking a maximum over estimated quanti- 
ties. Depending on the distribution of estimates, impor- 
tance sampling may systematically favor policies that are 
worse in expectation. 


We then present two new approaches for avoiding un- 
fairness when using importance sampling for policy se- 
lection. First, we give sufficient conditions under which 
using importance sampling for policy selection is fair, 
and provide algorithms that guarantee a related notion of 
safety. We also describe how our approach to guarantee- 
ing fairness and safety is related to the notions of power 
analysis and statistical hypothesis testing. Although of 
theoretical interest, the reliance of this approach on con- 
servative concentration inequalities limits its practicality. 
Thus, in our second approach, we introduce a new prac- 
tical IS-based estimator that lacks the theoretical prop- 
erties of our first approach, but which can help mitigate 
unfairness due to differing trajectory lengths. 


Although there is significant literature surrounding re- 
ducing the variance and mean squared error of off-policy 
policy evaluation methods that use IS-based estimators 
Li] 2016] [Fhomas and Brunskill] 2017), other challenges 
associated with off-policy policy selection, such as fair- 
ness, have not been explored in the literature. Similar 
notions of fairness have recently been proposed in the 
online RL setting, where an algorithm is fair if it never 
takes a worse action with higher probability than a bet- 


ter action Jabbari et al.| (2017). This is similar to our 


notion of fairness in the offline RL setting, where an al- 


gorithm is fair if it does not choose a worse policy with 
higher probability than the best candidate policy. How- 
ever, the issues surrounding fairness in the two settings 
are different due to the different nature of each setting. 
By introducing a notion of fairness for policy selection 
and highlighting some limitations of existing IS-based 
approaches, we hope to motivate further work on devel- 
oping practical and fair policy selection algorithms. 


2 BACKGROUND 


We consider sequential decision making _ set- 
tings in stochastic domains. In such do- 
mains, an agent interacts with the  environ- 
ment, and in doing so, it generates a trajectory, 


Tt © (Oo, Ai, Ri, O1, Ao, Ro,..., Ar, Rr, Or), which 
is a sequence of observations, actions, and rewards, 
with trajectory length JT. The observations and re- 
wards are generated by the environment according to a 
stochastic process—such as a Markov decision process 
(MDP) or partially observable Markov decision process 
(POMDP)—that is unknown. The agent chooses actions 
according to a stochastic policy 7, which is a conditional 
probability distribution over actions A; given the partial 
trajectory Tht—1 4 (Oo, A, Ri, O71, Ao, Rao, ae § O1-1) 
of prior observations, actions, and rewards. The value 
of a policy 7, V, is the expected sum of rewards when 


the policy is used: V™ = E ba Ri 
T ~ 7 means that the actions of 7 are sampled according 


to 7. The agent’s goal is to find and execute a policy 
with a large value. 


TSC n| , where 


In this paper, we consider offline (batch) reinforcement 
learning (RL) where we have a batch of data, called his- 
torical data, that was generated from some known be- 
havior policy 7». We are interested in the problem of 
batch policy selection: identifying a good policy for use 
in the future. This typically involves policy evaluation or 
estimating the value of a policy 7. using the historical 
data that was generated from the behavior policy 7. If 
Te = Tp this is known as on-policy policy evaluation. 
Otherwise it is known as off-policy policy evaluation. 


2.1 ON-POLICY POLICY EVALUATION 


Although we are primarily interested in the off-policy 
setting, i.e., the setting where 7 4 7, we will also dis- 
cuss the problem of on-policy policy evaluation. This 
problem arises, for example, when running a random- 
ized control trial or A/B test to compare two policies. In 
this case, the value of each policy is directly estimated 
by running it to generate n trajectories, 71,72,..-,Tn, 
and then estimating the policy’s performance using the 
Monte Carlo estimator: Wine = 2 a Rit, 


where R;,; and T; denote the reward of 7; at time ¢ and 
the length of 7; respectively. 


2.2 IMPORTANCE SAMPLING 


In this paper, we primarily focus on off-policy policy 
evaluation and selection. Specifically, we focus on esti- 
mators that use importance sampling. Model-based off- 
policy estimators tend to have lower variance than IS- 
based estimators, but at the cost of being biased and 
asymptotically incorrect (not consistent estimators of 
V7) (Mandel et al.| 2014). In contrast, IS-based estima- 
tors can provide unbiased estimates of the value. There 
has been significant interest in using IS-based techniques 
in RL for policy evaluation (Precup et al.} [2000 
3016), as well as 


growing recent interest in using it for policy selection 


(Mandel et al] (2014) and the related problems of policy 


sarah = poe ne gradient Bot hoa 


Wangera] 0. 


The IS estimator 7 Precup et al, is given 
3 tas A r 4 
by: Vso = FG we Wi “4 Rit, ie w= 


Tt, Meet) The IS estimator is an un- 
t=1 my (ai,t|Ti,1:t-1)° : : : 
biased and strongly consistent estimator of V”° if 
te(a|T14-1) = O for all actions, a, and partial tra- 
jectories, 7.41, where m(a:|T12-1) = 0. How- 
ever, Vic° often has high variance. The weighted 
Vwis = 


Rt, is a variant of the IS esti- 


importance sampling (WIS) estimator, 

1 n Te; 
yw; dint Wi ead 
mator that often has much lower variance, but which is 
not an unbiased estimator of V¢. 


3. FAIR POLICY SELECTION 


A policy selection algorithm is given a set of candidate 
policies and must choose one of the policies for use in the 
future. Any policy evaluation algorithm (i.e., estimator) 
can be converted to a policy selection algorithm by sim- 
ply evaluating each policy using the estimator, and then 
selecting the policy that has the largest estimated value. 
Thus we can use the Monte Carlo estimator for on-policy 
policy selection, and we can use the IS or WIS estimators 
for off-policy policy selection. 


There are two natural properties we would like in a batch 
policy selection algorithm: 


e Consistency: In the limit as the amount of historical 
data goes to infinity, the algorithm should always 
select the policy that has the largest value. 


e Fairness: With any amount of historical data, the 
probability that the algorithm selects a policy with 


the largest value should be greater than the proba- 
bility that it selects a policy that does not have the 
largest value. When choosing between two poli- 
cies, this implies that the algorithm should choose 
the better policy at least half the time] 


Exploring and ensuring the fairness of (IS-based) pol- 
icy selection algorithms is the focus of this paper. There 
has been recent interest on combining model-based es- 


timators and IS-based estimators 
Thomas and Brunskill} |2016), however as both model- 


based estimators and (as we will show) IS-based estima- 
tors are unfair, it is easy to show that these new estimators 
must also be unfair. Therefore, we restrict our attention 
to the standard IS and WIS estimators. 


Before proceeding, we formally define a way to rank 
policies. Let the better-than operator > g, be such that 
7 Bp M2 is True if 7 is better than 72 and False 
otherwise, for some notion of “better”. For example, we 
can define >y to order policies based on their values: 
T, ~v 2 1s True if V™! > V™ and False other- 
wise. Define the optimal policy 7* in a policy class II to 
be the policy where x* > a’ for all 2’ 4 1* € ni] We 
will now formally define fairness. 

Definition 3.1. A policy selection algorithm that chooses 
policies from a policy class II is fair with respect to a 
better-than operator > pg if whenever the algorithm out- 
puts a policy, the probability that it outputs 7* is at least 
as large as the probability that it will output any other 
policy. The algorithm is strictly fair if the probability of 
outputting policy 7” is strictly greater than the probabil- 
ity of outputting any other policy. 


Notice that the probabilistic guarantee in this definition 
conditions on when the algorithm outputs a policy. This 
allows for a policy selection algorithm that does not out- 
put any policy in cases when it cannot determine which 
policy is better. Also, notice that the trivial policy selec- 
tion algorithm that never outputs a policy is fair. How- 
ever, ideally we want a policy selection algorithm that 
outputs a policy as often as possible while maintaining 
fairness. This is an important distinction: although we 
want an algorithm that often outputs a policy, we re- 
quire the algorithm to at least be fair. We now see that 
this seemingly straightforward property is not satisfied 
by even the most natural policy selection algorithms. 


We begin by showing that even Monte Carlo estimation 
is unfair when used for on-policy policy selection. Sup- 
pose we want to select the better of two policies, 7, and 


‘For simplicity, hereafter we assume that there are no two 
candidate policies that are equally good. 

2We assume such a best policy exists; however, for some 
reasonable better-than operators, this may not be true, as we 


explore in Section[5. 1] 


Table 1: The probability of each action under 7, and 72 
for the example domain where Monte Carlo estimation is 
unfair. Rewards, R, are deterministic. 


ai(R = 0) a2(R = r) a3(R = 1) 
1 0 1 0 
12 1l-—p 0 Pp 


72, in a multi-armed bandit (MAB) domain with three 
actions @1, @2, and a3 with rewards and probabilities as 
described in Table[i] Notice that V™! = rand V™2 = p. 
So, if r < p, then V™! < V2. However, notice that 
using one trajectory (n = 1), the Monte Carlo estima- 
tor is unfair with respect to >y if r < p < 0.5 since 
Pr(VYre 1. < Vaid.) = p. We can similarly show that 
Monte Carlo policy selection is unfair using n trajecto- 
ries for n > 1, as long as r and p are sufficiently small. 


4 UNFAIRNESS OF IMPORTANCE 
SAMPLING POLICY SELECTION 


Unsurprisingly, importance sampling is also unfair with 
respect to >y. However, the unfairness of importance 
sampling can be arbitrarily worse than the unfairness of 
the Monte Carlo estimator, in that for any n, we can con- 
struct a domain such that IS policy selection is unfair 
even though the Monte Carlo estimator will always pick 
the correct policy with even a single sample! We provide 
one such example in Supplementary Material [A] In this 
section, we present two examples that highlight how the 
unfairness of importance sampling can arise in counter- 
intuitive ways in practically interesting settings, motivat- 
ing the importance of caring about satisfying fairness. 


4.1 IS FAVORS MYOPIC POLICIES 


In the following example we show that even when com- 
paring two policies that are equally close to the behavior 
policy, importance sampling can still be unfair. In par- 
ticular, we show that using IS for policy selection could 
be biased in favor of myopic policies, which could be of 
significant practical concern. This may come up in prac- 
tical settings where we are interested in comparing more 
heuristic methods of planning (e.g., short look-ahead) to 
full-horizon planning methods. If we have the correct 
model class, full horizon planning is expected to be op- 
timal, however it is both computationally expensive (so 
possibly not even tractable) and potentially sub-optimal 
if our model class is incorrect (e.g., our state representa- 
tion is inaccurate or the world is a POMDP but we are 
modeling it as a MDP). Thus, we may be interested in 
comparing full-horizon planning (or an approximation 
thereof) to myopic planning, and the following exam- 


R(ar) = 10 


Figure 1: Domain in Sectionf4. 1] The agent is in a chain of length 10. In each state, the agent can either go right (ap) 
which progresses the agent along the chain and gives a reward of 0 unless the agent is in 519, in which case it gives a 
reward of 10 (and keeps the agent in the sj9), or go left (az), which takes the agent back to state s; and gives a reward 
of 1. 


with probability 0.5 with probability 0.5 


R(ax) =1 R(ax) =1 — )=0 
R(ay) =0 R(ay) =0 
Figure 2: Domain in Section [4.2] The agent is placed uniformly at random in either a chain MDP of length 2 or a 


chain of length L. At each time step, action ax deterministically gives a reward of | to the agent if the agent is in the 
chain of length 2 and 0 otherwise, and action ay deterministically gives a reward of 1 to the agent if the agent is in the 


R(ax) =0 — R(ax) =0 


R(ay) = 


Rlayj=1_ 


chain of length L and 0 otherwise. Both actions progress the agent along the chain. 


ple shows that IS can sometimes favor policies resulting 
from myopic planning even when full horizon planning 
is optimal. 


Consider the MDP given in Figure [I] Now suppose we 
have data collected from a behavior policy 7, that takes 
each action with probability 0.5 and all trajectories have 
length 200. We want to compare two policies: Tmyopic 
which takes ay, with probability 0.99 and ap with proba- 
bility 0.01, and 7,4 which takes ay, with probability 0.01 
and ap with probability 0.99. (Note: the actual optimal 
policy is to always take ap, for which 75,; 1s a slightly 
stochastic version.) Notice that the probability distribu- 
tion of importance weights is the same for both Tmyopic 
and 7p, 80 both are equally close to the behavior pol- 
icy in terms of probability distributions over trajectories. 
However, for datasets that are not large enough, the im- 
portance sampling estimate will be larger for Tmyopic 
than for 7,4, even though it is clearly the worse policy. 
For example, when we have 1000 samples, (1) around 
60% of the time, the importance sampling estimate of 
Tmyopic 18 larger than that of 7,,;, and (2) around 95% 
of the time, the weighted importance sampling estimate 
of Tmyopic iS larger than that of 7,,;. Thus both the IS 
and WIS estimators are unfair for policy selection. 


The reason IS is unfair in this case is because one policy 
only gives high rewards in events that are unlikely under 
the behavior policy, and hence the behavior policy often 
does not see the high rewards of this policy as compared 
to a myopic policy. However, note that these events are 
still likely enough that we can build a model that would 


suggest choosing the optimal policy. IS is unable to de- 
tect simple patterns that a model-based approach (or even 
a human briefly looking at the data) would easily infer; 
this is the cost of having an evaluation technique that 
places virtually no assumptions on policies. 


4.2 IS FAVORS SHORTER TRAJECTORIES 


Importance sampling can also systematically favor poli- 
cies that assign higher probability to shorter trajectory 
lengths in domains where the length of each trajectory 
may vary. This is a problem that could arise in many 
practical domains, for example domains where a user is 
free to leave the system at any time, such as a student 
solving problems in an educational game or a user chat- 
ting with a dialogue system. In such systems, a bad pol- 
icy may cause a user to leave the system sooner resulting 
in a short trajectory, which makes it particularly prob- 
lematic that importance sampling can favor policies that 
assign higher probability to shorter trajectories. The fol- 
lowing example shows that importance sampling can fa- 
vor policies that generate shorter trajectories even when 
they are clearly worse. 


Consider the domain given in FigureP} Now suppose we 
have data collected from a behavior policy 7, that takes 
each action with probability 0.5. We want to compare 
two policies: 7x, which takes action ax with probabil- 
ity 0.99, and zy, which takes action ay with probability 
0.99. Consider the case where L = 80. Clearly my is 
the better policy, because it incurs a lot of reward when 
we encounter trajectories of length 80, while only los- 


Table 2: Median estimates, out of 100 simulations, of 
different estimators using 100 samples of 7 and zy in 
the domain in Sectionf.2| 


Vuc Vis Vis 
mx 139 0.98 1.98 
my 39.52 0.010 0.020 


ing out on a small reward when encountering the short 
trajectories. Table 2] shows the median estimate, out of 
100 simulations, of the Monte Carlo estimator, as well 
as the median IS and WIS estimates using 1000 samples 
each. We find that while zy is, in actuality, much better, 
IS essentially only weighs the shorter trajectories, so the 
estimates only reflect how well the policies do on those 
trajectories. WIS simply (almost) doubles the estimates 
because half of the samples have extremely low impor- 
tance weights. So why does this occur? When using IS in 
settings where trajectories can have varying lengths, the 
importance weight of shorter trajectories can be much 
larger than for longer trajectories, because for longer tra- 
jectories we are multiplying more ratios of probabilities 
that are more often smaller than one. This happens even 
if the policy we are evaluating is more likely to produce 
longer trajectories than a shorter one (because there are 
exponentially many longer trajectories and so each in- 
dividual trajectory has an exponentially smaller weight 
than an individual short trajectory). 


5 A NEW KIND OF FAIRNESS 


The examples above illustrate that even the most straight- 
forward way of evaluating policies could misguide some- 
one about which policy is actually better under the ob- 
jective of maximizing the expected sum of reward. Al- 
though we showed that off-policy policy selection using 
IS can be much less fair than doing on-policy Monte 
Carlo estimation, the fact that on-policy estimation is 
also unfair suggests that perhaps we are using the wrong 
measure of performance to construct our better-than op- 
erators. The problem is that, even if 77, usually produces 
trajectories with more reward than those produced by 
72, it could still be that V™! < V™”? if there is a rare 
trajectory with very large reward that is more likely un- 
der 72. It is therefore worth considering a different no- 
tion of “better-than” that better captures which policy is 
likely to perform better given a fixed and finite amount 
of historical data. This can be captured using the follow- 
ing better-than operator, which we refer to as better-than 
with respect to Monte Carlo estimation: 7 >mc,n 72 is 


True if Pr(Vagdn, > Vien) 2 PrVaten < Vaten)> 
and False otherwise. 


Notice that this notion of better-than is also exactly what 


we would want to use if we want to find a policy where 
we optimize our chance of beating a baseline in an ex- 
periment or A/B test with n samples (i.e., obtaining a 
statistically significant result). Also, notice that we have 
a different better-than operator for each n, which is not 
necessarily related to the number of samples that we have 
from our behavior policy 7». (For example, we may 
have collected data from interactions with 1,000 users, 
but want to deploy a policy and expect one million users 
to use it; in that case we may want to be fair with respect 
to >mc,106-) We will use >mc to refer to this notion of 
better-than in general (even though it is not technically 
a better-than operator without specifying the number of 
samples). Of course, for this new better-than operator, 
the Monte Carlo estimator is fair by definition. It is in- 
teresting to note that if the distribution over the sum of 
rewards under the two policies are symmetric distribu- 
tions (e.g., normal distributions), then the two better-than 
operators are equivalent. It may seem as though it is eas- 
ier to satisfy fairness for importance sampling with re- 
spect to >mc,n than with respect to >; however, we 
will show below that, at least in some sense, this is not 
the case. However, we can present conditions where both 
are satisfied simultaneously. We will henceforth consider 
and present results with respect to both notions of fair- 
ness (i.e., fairness with respect to >y and fairness with 
respect to >mc,n). But we will first consider some coun- 
terintuitive properties of this new better-than operator. 


5.1 FAIRNESS AND NON-TRANSITIVITY 


It is a well-known result in probability theory that there 
exists random variables X, Y, and Z such that Pr(X > 
Y) > 0.5 and Pr(Y > Z) > 0.5, but Pr(Z > X) > 
0.5 as well, indicating that comparing between random 
variables in such a way is non-transitive (Trybufa] 7961} 
[Gardner] |T970). We claim that this non-transitivity holds 


for the ordering induced by the >mc,n operator as well. 
This is shown in Supplementary Material [B] In fact, it 
is possible that for any policy 7, there is another policy 
m’ where >mc.n (7,7) = True, meaning a best policy 
might not exist with respect to >wc. 


The good news is that over fifty years ago, 
(7965) showed that for any m_ indepen- 
dent random variables Xj,...,Xm, min{Pr(X; > 
X»2),...Pr(Xm-1 > Xm), Pr(Xm > X1)} < 0.75. 
Thus, in order to avoid such non-transitivity, we motivate 
a new kind of fairness: 


Definition 5.1. An algorithm is transitively fair with re- 
spect to a better-than operator > if it is fair with re- 
spect to > g and if the algorithm does not output a policy 
when comparing two policies from a set of policies 71, 
..., 7 Such that 7; > p 741 fori € {1,...,4—1} and 


Tk > BTM. 


Clearly any algorithm that is fair with respect to >y is 
also transitively fair, as comparing real numbers is a tran- 
sitive operation. This is not the case for any algorithm 
that is fair with respect to >y4c,,, but one way to ensure 
transitive fairness in this case is to not only be fair with 
respect to >mc,n, but to also not output a policy unless 


PHV ih > Vea) = Oe. 
6 GUARANTEEING FAIRNESS 


Given that importance sampling is not fair in general, we 
would like to understand under what conditions we can 
guarantee importance sampling can be used to do fair 
policy selection. Recall that even Monte Carlo estima- 
tion is not fair, so we would also like to give conditions 
for which we can guarantee on-policy policy selection 
can be done fairly. Thus, we are interested in guaran- 
teeing the following four notions of fairness: (1) fairness 
with respect to >y when we have samples from each 
policy, (2) fairness with respect to +mc,n when we have 
samples from each policy, (3) fairness with respect to >y 
when we have samples from a behavior policy, and (4) 
fairness with respect to >mc,n when we have samples 
from a behavior policy. 


Notice that case (2) is satisfied whenever we use Monte 
Carlo estimation for policy selection by definition of 
>mc. We now give theorems describing the conditions 
under which we can satisfy the remaining three cases. 
Let Vyj,x be the largest value that could result from pol- 
icy 7 and wy, be the largest importance weight possible 
for policy 7 with samples drawn from behavior policy 
Ty. In what follows, for simplicity, we assume that the 
minimum possible value of a trajectory for all policies is 
oF Our results can be extended in the case that this is not 
true by considering the minimum possible value for each 
policy. Theorem [6.1] gives the conditions for which us- 
ing the on-policy Monte Carlo estimator is fair for policy 
selection, when comparing between two policies. 

Theorem 6.1. Using the on-policy Monte Carlo esti- 
mator for policy selection when we have n samples 
from each of policies 7 and T2 is fair provided that 


Vitex + Vue < |V™ — V™|4/2% We can guarantee 


Strict fairness if the inequality above is strict. 


Similarly, Theorem 6.2] gives conditions for which using 
the importance sampling estimator for policy selection is 
guaranteed to be fair with respect to >y. Algorithm[I]is 


3In the off-policy case, all our results also hold under the 
very mild assumption that for each policy we evaluate there is 
some trajectory that has non-zero probability under 7 but has 
0 probability under the evaluation policy. 


a fair policy selection algorithm that guarantees fairness 
whenever the condition in Theorem |6.2/is met, and oth- 
erwise returns No Fair Comparison, provided that 
e>|V™' —V™| and 6 > 0.5. Setting 6 = 0.5 is suffi- 
cient to guarantee fairness, but we can guarantee stronger 
notions of fairness by choosing some 6 > 0.5 (i.e., when- 
ever the algorithm outputs a policy, it outputs the better 
policy with probability at least 6 > 0.5). While we only 
consider comparing between two policies in this section, 
we can extend our results to the case where we select a 
policy from a class of n > 2 policies, as we show in 
Supplementary Material [D] 

Theorem 6.2. Using importance sampling for policy se- 
lection when we have n samples from the behavior policy 
is fair with respect to >y, provided that 


2n 
TW TT 9) 2 T TT 

tt V Ma + tine Max = IV aed | In2 
n 
We can guarantee strict fairness if the inequality above 
is strict. 


Theorems and [6.2] can both be shown with a sim- 
ple application of Hoeffding’s inequality; the proofs are 
given in Supplementary Material IC Alternatively, we 
can use other concentration inequalities to obtain fair- 
ness conditions/algorithms of a similar form. Notice that 
Theorem 6.2} tells us that as long as neither policy is too 
far from the behavior policy in terms of the largest pos- 
sible importance weight, then we can guarantee fairness, 
which intuitively makes sense; we can only fairly com- 
pare policies that are similar to the behavior policy. How- 
ever, how far we stray will also depend on how different 
the values of the policies are from each other. This is a 
quantity we do not know, so we must pick an € where ei- 
ther we think « > |V™! — V™| or we are comfortable 
with the possibility of selecting a policy whose value 
is € worse than that of the better policy. Thus € can be 
thought of as a hypothetical effect size as would be en- 
countered in hypothesis testing. To make the analogue 
with hypothesis testing more clear, notice that if we fix 
the policies that we want to compare, we can instead con- 
vert Theorems[6. Tand[6.2}to give lower bounds on n that 
guarantee fairness; that is, we can ask how many sam- 
ples do we need before we can fairly compare between 
two specific policies. This is analogous to doing a power 
analysis in the hypothesis testing literature. A critical 
difference is that in hypothesis testing we are typically 
interested in minimizing the probability of a bad event, 
whereas here we are ensuring that the better of the two 
policies is chosen more often. Furthermore, in the off- 
policy case, we are testing counterfactual hypotheses— 
hypotheses that we never run in the real-world. 


Notice that Theorem [62]is satisfied for a much smaller 
subset of policies than Theorem [6.1] as Wax Ad Wha 


can be huge (exponential in the trajectory length). This 
is not surprising and matches our intuition (as seen in the 
examples above) that IS can be very unfair even in cases 
where on-policy selection is fair. 


Algorithm 1 Off-Policy FPS->y 
Require: 71,72, Vita View: © 9 


T1,7T2,+++5Tn ™ Mb 
2n 
if Wwtax VMax + WMaxVMax —_ In1/6 then 
return max(V,2", V2?) 
else 
returnNo Fair Comparison 
end if 


In order to satisfy fairness with respect to > jc,n, the 
condition on the difference between the two policies 
will be in terms of the difference between typical Monte 
Carlo estimates of the value, as shown in Theorem 
We can also satisfy transitive fairness with a stricter as- 
sumption on the difference between the two policies. 


Theorem 6.3. For any two policies 7 and 72, behav- 
ior policy mp, and for all k € {1,2,3,...}, using im- 
portance sampling for policy selection when we have n 
samples from the behavior policy is fair with respect to 
>mc,kn provided that there exists ¢ > O and 6 < 0.5 


such that Pr (ation - Vie. tale c) > 1-6 and 


2 
(Witax af Viet (Wifax + 1) Vira SE In = : Impor- 
tance sampling in this setting is transitively fair, provided 


that 6 < 0.25. 


The theorem essentially says that as long as the con- 
ditions hold, we can guarantee fairness with respect to 
>mc,m for any m that is a multiple of n (which is 
slightly weaker than being able to satisfy fairness for 


all m > n). Notice that if we compare Theorem [6.2] 


with Theorem 6.3] for the same choice of € and for any 
choice of 6 < 0.5 (in Theorem |6.3), then provided that 
the conditions on the difference between the two policies 
for both theorems hold, the latter is satisfied for a strict 
subset of policies. Thus, we can use Theorem [6.3] to si- 
multaneously guarantee fairness with respect to >y and 
with respect to >mc,m (provided that the policy effect 
size conditions of both theorems hold). It might seem 
strange that satisfying >mc,n requires a stricter test than 
for satisfying >, as it might seem as though importance 
sampling would be more likely to choose the same policy 
as the Monte Carlo estimator than the policy that has a 
higher value (when they are not the same). However, this 
is not necessarily the case, as the policy that importance 
sampling chooses will depend on the behavior policy. 


6.1 SAFETY 


As mentioned above, our conditions on fairness require 
us to have a reasonable estimate of an effect size between 
the policies we want to compare. It would be worthwhile 
to have a guarantee that does not require us to speculate 
about this unknown quantity. In this section, we give 
another type of guarantee which we refer to as safety. 
Safety has been considered as a property for policy evalu- 


ation and policy improvement algorithms 
2015afb). Here we extend the property to apply to policy 
selection algorithms. 


Definition 6.1. A policy selection algorithm that chooses 
policies from a policy class II is safe with probability 
1 —6 with respect to a better-than operator > p if when- 
ever it outputs any policy 7 such that there exists another 
policy 7 € Il where 7 ¥% 8 7, it does so with probability 
at most 6 for some 6 < 0.5. 


Thus, a safe policy selection algorithm does not output a 
sub-optimal policy often; however, it is still possible for 
a safe policy selection algorithm to output a sub-optimal 
policy more often than the best policy—but in that case, 
the algorithm won’t output any policy often. This def- 
inition is weaker than fairness, but as we will see, we 
can satisfy it without requiring knowledge about an ef- 
fect size between the policies. Theorem [6.4] gives the 
conditions for a safe policy selection algorithm with re- 
spect to >y and Theorem[6.5|gives analogous conditions 
for a safe policy selection algorithm with respect to >mc, 
both using Algorithm P]as the underlying algorithm. The 
proofs that these algorithms guarantee safety are given in 
Supplementary Material [E] 


Algorithm 2 Off-Policy SPS 
input 71, 72, w, D, 


T157T25+++5Tn ™ Mb 
In(2/(1=p)) 

Bew on 
if Vic" — Vis? — 2 > Othen 

return 71 
else if Vig' — Vic? + 6 < 0 then 

return 7 
else 

returnNo Fair Comparison 
end if 


Theorem 6.4. For any two pons 1 and %2, behavior 
policy To, W = Wetax Vmax t OM ee Mae and 6 = 0. 5, Algo- 
rithm lis a safe policy selection algorithm with respect 
to >y with probability 1 — 6 


Theorem 6.5. For any two policies ™, and 172, where 
r (Ia kn — ve ele 0) > 1 — 6muc any behavior 


policy %, w = (wijax + LW) Vie + (What 1) Vitae p= 


1—dmucé for some 6 < 0.5, and for allk € {1,2,3,...}, 
AlgorithmPlis a safe policy selection algorithm with re- 
spect to >yc,kn with probability 1 — 6 when we have n 
samples drawn from Tp. It is a transitively fair policy 
selection algorithm whenever dyc < 0.25. 


Note that these algorithms are analogous to statistical hy- 
pothesis testing in that we compare the lower bound of 
our estimate of the value of one policy with the upper 
bound of our estimate of the value of another policy. This 
analogue is similar to how our fair policy selection algo- 
rithms shared much in common with doing power analy- 
ses for hypothesis testing. Also note that as with fairness, 
Theorem [6.5] ensures safety with respect to both better- 
than operators, so if one uses AlgorithmP]with the inputs 
as described in Theorem[6.5] one does not have to deter- 
mine which better-than operator one is using. Again, we 
find that safety with respect to >yc is more difficult to 
satisfy than safety with respect to >. We can formalize 
this with the following theorem. 


Theorem 6.6. There exists policies 7, 12, and behavior 
policy 7% for which Algorithm[|with inputs as described 
in Theorem [6.4] is not a safe policy selection algorithm 
with respect to >yc,1 with p = 0.5 when we have a sin- 
gle sample drawn from Tp. 


7 PRACTICAL FAIRNESS: 
VARYING TRAJECTORY LENGTHS 


While the algorithms above provide a way to guarantee 
fairness, the concentration inequalities we used are nat- 
urally quite loose in most cases, and would likely result 
in returning No Fair Comparison in many cases. 
Practically, it would be desirable to have algorithms that 
can provide fair comparisons more often. As a first step 
in this direction, here we discuss a heuristic approach to 
policy selection for domains where we have varying tra- 
jectory lengths, as seen in Sectionf2] The reason for fo- 
cusing on this particular aspect of unfairness is because 
it is systematic (potentially arising in any domain where 
trajectories vary in length), yet it seems like there should 
be a way to correct for the systematic preference towards 
shorter trajectories in practically relevant domains. The 
idea we propose here is to compute an IS-based estimate 
for each trajectory length individually and then recom- 
bine the estimates to get a new estimate. We propose us- 
ing the following estimator, which we refer to as the Per- 
Horizon Weighted Importance Sampling (PHWIS) esti- 
mator: 


Ty 
Veuwis = \. Wi : S- wi S- Rit 


fe tI Wi {7:|T=l} t= 1 


WIS estimate on /-length trajectories 


— On Policy 
—@- PHWIS-Evaluation 
>< PHWIS-Estimated 
—f- PHWIS-Behavior 
AIS | 
—y— WIS 


Probability of Choosing Policy zy 


0 20 40 60 80 


L 
Figure 3: Probability of various estimators choosing my 
over 7x for different values of L in the domain given in 
Section 2) with 1000 trajectories drawn from the uni- 
form random behavior policy. For each estimator, the 
probability of outputting 7y was estimated using 100 in- 
dependent estimates. 


where L is the set of trajectory lengths that appear in the 
data and W, is a weight for the relative importance of 
each trajectory length. 


Notice that in the domain in Sectionf.2] the length of the 
trajectories did not depend on the policy that was used 
to generate them; in such cases, we can use the follow- 
ing weights: W) (Behavior) * Merl Te=U The weights 
simply count the proportion of trajectories in our data 
(i.e., generated by the behavior policy) that have length 
1. We will refer to PHWIS with this weighting scheme 
as PHWIS-Behavior. Now we will see how this esti- 
mator performs on the domain in Section 4.2] (Figure [2p 
where L € {1,3,5, 10, 20, 40, 60, 80} given 1000 trajec- 
tories from the uniform random behavior policy. Fi gure[3} 
shows that while IS and WIS are unfair (choose 7.x more 
often than zy) when the long trajectories are of length 
20, PHWIS-Behavior always chooses the policy that the 
on-policy Monte Carlo estimator would choose (i.e., 7x 
when L = 1, and zy otherwise). 


However, note that in cases where different policies 
may generate trajectories of different lengths (for ex- 
ample, bad policies causing users to dropout sooner), 
this simple form of weighting might not work too well. 
Ideally, we would like to use the following weights: 
W, (Evaluation) & Pr(i (|r| = I|7 ~ 7<)), where 7¢ is 
the evaluation policy for which we would like to esti- 
mate V7. We cannot actually compute these weights 
because we do not know the probability of any trajec- 
tory being generated by the evaluation policy, but when 
we have ground truth, we can use the PHWIS-Evaluation 
estimator as a point of comparison. To approximate these 
weights, we can take the weighted importance sampling 
estimate of the trajectory lengths generated by the be- 
havior policy: W; (WIS) + Stam Ye wil (=) 


(i.e., rather than reweighing the rewards of all the tra- 
jectories generated by the behavior policy, we reweigh 
the probability that trajectories of length / are gener- 
ated by the behavior policy). While this is a seem- 
ingly reasonable thing to do, this estimate will still 
suffer from high variance for the same reason the 
WIS estimator does, which would again lead to as- 
signing small weights for longer trajectories. Instead 
we propose a heuristic way to make weights for dif- 
ferent trajectory lengths have comparable magnitudes: 


Wi (Estimated) © y4— 0, w;/ "1 (Tj = 1). The 
idea behind these weights is that they should give us a 
sense of which weights are preferred by the evaluation 
policy while maintaining that weights of different trajec- 
tory lengths have comparable magnitudes. As we see 
in Figure B] PHWIS-Estimated has almost identical per- 
formance to PHWIS-Behavior and PHWIS-Evaluation 
(with just a small probability of choosing the wrong pol- 
icy when both trajectory lengths are short). However, the 
true value of using such an estimator comes in domains 
where the length of a trajectory depends on the policy, 
and so PHWIS-Behavior may not be sufficient. We now 
examine one such domain. 


7.1 POLICY-DEPENDENT TRAJECTORY 
LENGTHS 


Consider a MDP that has three states—sg, s,, and sg— 
and two actions—X and Y. The agent starts in so and, 
on the first time step, if the agent takes action X, they 
deterministically transition to s,; and receives a reward 
of r, and if the agent takes action Y, they transition de- 
terministically to sz and receive a reward of 1. Thereafter 
the agent will always remain in the same state until the 
trajectory ends and will receive a reward of r whenever 
it takes action X in state s;, a reward of 1 whenever it 
takes action Y in state sj, and a reward of 0 otherwise. If 
the first action was X, the trajectory will end with prob- 
ability 0.05 after each action, and if the first action was 
Y, the trajectory will end with probability 0.01 after each 
action. Thus, taking action Y in the beginning will result 
in a trajectory that is five times as long in expectation. 


The behavior policy 7) takes each action with probabil- 
ity 0.5. Again, we want to compare two policies: 7x, 
which takes action X with probability 0.99, and zy, 
which takes action Y with probability 0.99. Whether 7x 
or my is better will depend on r. The policies have the 
same value when r = 5, because the difference in re- 
wards offsets the difference in lengths. 


Figure f] shows which policy is chosen when using dif- 
ferent estimators for the domain described above with 
various values for r. We find that the IS, WIS, and even 
PHWIS-Behavior estimators are unfair regardless of the 


Probability of Choosing Policy zy 


Figure 4: Probability of various estimators choosing 7y 
over 7x for different values of r in the domain given in 


Section|7. 1] 


value of r (except for PHWIS-Behavior when r = 1). 
PHWIS-Evaluation tracks the on-policy estimator rea- 
sonably well, only sometimes choosing the wrong pol- 
icy in the r = 4 case. PHWIS-Estimated similarly tracks 
the on-policy estimator reasonably well, but it sometimes 
chooses the wrong policy in the r = 3 case and is un- 
fair when r = 4. Thus, PHWIS-Estimated seems to be 
a reasonable policy selection estimator to use in such 
domains, and can be much better than using PHWIS- 
Behavior in some cases. 


8 CONCLUSION 


In this paper, we examined the problem of off-policy pol- 
icy selection and introduced a new property for policy 
selection algorithms called fairness. We showed that im- 
portance sampling is unfair when used for policy selec- 
tion even though it is an unbiased estimator for policy 
evaluation. We presented two approaches to deal with 
this issue, a theoretical solution and a new practical es- 
timator. This is but a first step in tackling the issue of 
fairness in off-policy policy selection. Our hope is that 
our introduction of the notion of fairness for policy se- 
lection will result in growing interest on the challenges 
involved in doing off-policy policy selection, including 
how unfairness propagates to policy search methods that 
optimize over an infinite class of policies. 
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SUPPLEMENTARY MATERIAL 


A UNFAIRNESS- OF 
SAMPLING 


IMPORTANCE 


Suppose we want to use importance sampling to select 
the better of two policies, 7, and 7, where we have prior 
data collected from 7), in a MAB with two actions a, 
and a2, with rewards and probabilities as described in 
Table B] Notice that V*" = p+ (1 — p)r and V*- = 1, 
so a fair policy selection algorithm should choose 7 at 
least half the time since p + (1 — p)r < 1. If we draw 
only a single sample from 7, we get that with probabil- 
ity 1 — p, IS would select 7, over 7-. Thus as long as 
p < 0.5, IS will be unfair. Furthermore, notice that as 
we decrease p, the gap between the performance of the 
policies increases, yet the probability that IS chooses the 
right policy only decreases! 


Now suppose we draw n samples from 7. Notice that as 
long as 72 is never sampled, IS will choose 7», since in 
that case ve = 0. 7 will never sample 72 with proba- 
bility (1 — p)”. Thus IS is unfair as long as (1 — p)” > 
0.5, or as long as p < 1 — 0.5!/" = In(2)/n for large n. 


It may appear that this unfairness is not a big problem 
when we have a reasonable number of samples, but the 
practical significance of this problem becomes more pro- 
nounced in more realistic domains where we have a large 
number of possible trajectories, or equivalently, a long 
horizon. For example consider a domain where there are 
only two actions and the agent must take 50 sequential 
actions and receives a reward only at the end of a trajec- 
tory. Furthermore, consider that the only valuable trajec- 
tory is to take a particular action for the entire trajectory 
(analogous to a2 above). In this case, we would need 
over 10'* samples just to get IS to be fair! 


Table 3: Domain in Supplementary Material[A] Rewards 
are deterministic. The bottom two rows give the proba- 
bility distributions for 7. and 7 over the two actions. 


ay, ag 
(R=r<1) (R=1) 

Te 0 1 

Tb l—p p 


B NON-TRANSITIVITY 


Theorem B.1 (Non-Transitivity of -Mcjn). The relation 
induced by >mc,n is non-transitive. Specifically, there 
exists policies M1, 12, and 13 where 

Pr(Vifen > Veen) = PrVien > Veen) = 


Pr(Viren > Viren) = 1—¢ © 0.618 where o = Y41 
is the golden ratio. Moreover, it is possible that for 
any policy m, there is another policy nm’ where ~mucjn 
(’,7) = True. 


Proof of Theorem|B_]] Consider a multi-armed bandit 
where there are three actions: a1, which gives a reward 
of n + 1 with probability p and a reward of 0 with prob- 
ability 1 — p, az which always gives a reward of 1, and 
a3 which gives a reward of n? + n + 1 with probability 
1 — q and a reward of 0.5 with probability g. Suppose 
policies 71, 72, and 73 always choose action a1, a2, and 
a3 respectively. Now suppose we want to estimate the 
three policies with n on-policy samples from each. We 
have that 7; gives a higher reward than 72 whenever we 
get the large reward at least once, which happens with 
probability 1 — (1 — p)”. Thus 


Pr(Vitén > Vien) =1—-(1—-p)” 


Furthermore, clearly 72 gives a larger reward than 73 
whenever all samples of 72 give a reward of 0.5, which 
happens with probability q”. Now, finally we see that 73 
gives a larger reward than 7, whenever it gives at least 
one sample with a large reward or when both of them 
give only samples of their small rewards, which happens 
with probability (1 — gq”) + q”(1—p)”, so 


Pr(Vivon > Vaiein) = 1-4") +4" =p)" 
Now let p = 1 — (2 — ¢)/" and q = (@ — 1)/", where 
o= v541 ~ 1.618 is the golden ratio. Thus we have 


that: : : 
Pr(Vuien > Vien) = 9 — 1 


PiVieee Weg =e—41 


Pr(Vir8. 4, > Vitk.n) = (2-0) + (6-1) 2-9) = G¢-1 


We now show that for this multi-armed bandit, there is 
no optimal policy with respect to >mcjn. A policy in 
this setting is simply a distribution over a1, a2, and az. 
Equivalently, we can view any policy as a mix of the poli- 
cies 71, 72, and 73. Suppose a policy 7 executes 7 with 
probability p, 72 with probability g, and 73 with proba- 
bility r. If p is the largest of the probabilities, then notice 
that Pr(Vitz,,, > Vion) = p(b— 1) > a(@-1) = 
PVs > Vy73.,,). We can make a similar argument 
if qg or r are the largest probabilities. Thus, there is no 
optimal policy. 


C FAIRNESS PROOFS 


Theorem 6.1. Using the on-policy Monte Carlo esti- 
mator for policy selection when we have n samples 


from each of policies 7 and T2 is fair provided that 
Vitte + Vilde <[V™ — VIy/3 


In2 
Strict fairness if the inequality above is strict. 


We can guarantee 


Proof of Theorem|6.1| Suppose without loss of general- 
ity that V™ > V7. Let Vi = poe Rit (e., the 
estimate of the value of policy 7 using only 7,"). Now let 


Xia oP 


Note that the range of X; is [—Vi2,., Vata]. Let w be the 
difference between the upper and lower bounds of X;, 
that is, w = Via, + Vig, Because all 7;"' and 7;"? are 
independent of 7;' and 77° for all i # j, we know that 
X; is independent of X,; for all i # 7. Thus we can use 
Hoeffding’s inequality to find that: 


Pr (X <0) =Pr(X —E[X] 


< -B[X]) 


Note that 
S Ae z ‘ % 
AS Vi — V7 =ViaL — VAG 
n Ty a MC,n MC,n 


and E Fy | = V™ —V"”™., Thus, if we want to guarantee 


Pr (Vii — Vaten $0) <4 
we can simply guarantee 
—2n(V™ — V7)? 
exp ( ui 5 ) ) <6 
wW 


Solving for w, we must have that: 


2n 


So inc 8) 


Substituting 6 = 
Pr (Vie — 
ness when V™! > V2. Since we do not actually know 


which policy has a greater value, we can guarantee fair- 
ness with the following condition: 


0.5, we can thus guarantee that 
Vie > 0) > 0.5, which guarantees fair- 


2n 


<|V™ — 
ee) In2 


v™| 


Theorem 6.2. Using importance sampling for policy se- 
lection when we have n samples from the behavior policy 
is fair with respect to >v, provided that 


2n 
In2 


We can guarantee strict fairness if the inequality above 
is Strict. 


ust 2 ust 72 
Watax VMax + Wax VMax < lV = | 


Proof of Theorem{6.2| Let Vr = | w,* Rit (Le., 
the estimate of the value of policy 7 using only 7;). Now 
let 
= Ve _ v 

Note that the range of X; is [—wyp.. Vaiss Umax Vax] Let 
w be the difference between the upper and lower bounds 
of X;, that is, w = Wyyax Vitex + Wrox VM Because 7; 
and 7; are independent for all 1 # 7, we know that X; 

is independent of X, for all i # j. The rest of the proof 
follows exactly as in the proof of Theorem|6.1] 


Theorem 6.3. For any two policies 7 and 72, behav- 
ior policy m», and for all k € {1,2,3,...}, using im- 
portance sampling for policy selection when we have n 
samples from the behavior policy is fair with respect to 
>mc,kn provided that there exists € > 0 and 6 < 0.5 


such that Pr (Fi in pele c) > 1-6 and 


tance sampling in this setting is uta ae provided 
that 6 < 0.25. 


. Impor- 


Proof of Theorem{6.3| Suppose without loss of general- 


ity that Pr (VAid,en — Pardon 2 €) = 1-4. Recall that 
IS uses trajectories 71,..., Tp ~ 7». Consider additional 
random samples T/',...,7,,, ~ 71 and T[?,..., 7,2 ~ 


m™. Note that these samples are all independent from 
each other. For i € {1,2,... i let 


(i.e., the estimate of the value of policy 7 using 
only samples 7,7, 73;,...7;;). Furthermore let Vig, = 


aa w;R;.. Now let 


Vaiss 


= Visi — Visa) 
Notice that the range of X; is [—Vyj2,—Watax Vitex? Vitex + 
Wwtax VMaxl- Let w be the difference between the upper 
and lower bounds of X;, that is, w = (wy, + 1)Vanx + 
(Wytax + 1) Vitex: 

Notice that 


— Vie, 


and 


E [X] 


Thus we have that 


= (V™ —y™)—(V™_—V™) =0 


Pr (Vg — Vig? <0) = Pr (X <—(Waib an — Verband) 


Thus, we can use Hoeffding’s inequality to find that: 


P a m a —2ne? 
Pr (Vg — 0 < OW RE on — VaR an 2 €) Sex (35) 
; : wW 
So if we want to guarantee 
Pr (Vig! — Vig? <0) <y 


we can simply guarantee 
Algorithm 3 Off-Policy FPS for k policies 


input I, Yi. 6 p 
T = {m1,72,---,T ~ To}, FPS 
6 < (1—p)/(2k4+ 3) 
a* + IL.next 
Eliminated «+ 0 
. . Di CurrBeat + 
w= Vea Vc ka) ) ages repeat 
. , me am’ < (I\CurrBeat).next 
winner + FPS(n*, 1’, Vit, Viz 6, T) 


exp ve <7 


Solving for w, we must have that: 


Notice that . 7 if winner == 7* then 
Pr Cs =Ve 2 0) Eliminated — Eliminated U {7’} 
: . 7 . CurrBeat + CurrBeat U {x’} 
> Pr (Vg? — Vig? < OV an — Viren 2 €) else if winner == 7" then 
7 7 Tk 1! 
x Pr are — Vac kn 2 c) Eliminated Eliminated U {x*} 
> (1—7)(1—8) oo <+ CurrBeat U {x*} 
If we set y = 0,8 —8 | we have that a* < (II\Eliminated).next 
Eliminated + Eliminated U {x*,2’} 
Pr (ve -Vz> 0) > 0.5 CurrBeat + @ 
end if 
which is what we want. until len(Eliminated) == k — 1 or len(CurrBeat) == k 
. . if len(Eliminated) == k — 1 then 
As long as 6 < 0.25, we have that the IS is transitively ee 
fairness since any fair algorithm with respect to >MC,kn alse 


. . wai . als us 
is transitively fair whenever Pr(|Viicen - Vic, kn | > return No Fair Comparison 
0) > 0.75 end if 


D MULTIPLE COMPARISONS 


Here we present an algorithm that uses pairwise compar- 
isons to select amongst k > 2 policies (Algorithm Bh. 
This algorithm can take as input either of the off-policy 
fair policy selection algorithms above (or some variant 
thereof). 


Theorem D.1. For any finite set of k policies I, behav- 
ior policy Tp, p = 0.5, and fair off-policy policy selection 
algorithm FPS, AlgorithmBjis a Strictly fair policy selec- 
tion algorithm with when we have n samples drawn from 
Tp. 


Proof of Theorem|D.1| The algorithm essentially ap- 
plies an algorithm for finding the maximum element 
of a set, with the exception that whenever it cannot 
make a fair comparison between two policies, it will 
eliminate both of those policies from consideration 
of being better than all other policies with respect to 
the better-than function. The algorithm must return 
No Fair Comparison if and only if every policy 
is eliminated. Notice that we only eliminate a policy 
when it is not returned by FPS or when No Fair 
Comparison is returned, which is correct. Notice 
that until there are / — 1 policies that are eliminated, 
at every comparison at least one policy is eliminated 
and the last of those comparisons must include the 
only remaining non-eliminated policy. Afterwards it 
takes at most & — 2 comparisons with the final policy 
(comparing it to every other policy other than the one it 
was already compared to) to determine that no fair com- 
parison is possible, making a total of 2-3 comparisons. 


On the other hand, the algorithm must return a policy 
when that policy is outputted by FPS from comparisons 
with every other policy, which is exactly what it does 
(i.e. when the CurrBeated set includes k — 1 policies). 
The maximum number of comparisons it takes to output 
a policy is the number of comparisons it takes to elimi- 
nate k — 1 other policies plus the number of comparisons 
it takes to beat k — 2 policies using the same argument 
as above, making a total of 2k — 3 comparisons. Thus, 
if we let 06 = (1 — p)/(2k — 3) all of the comparisons 
made by FPS will simultaneously hold with probability 
1 — (2k — 3)d = p, and so setting p = 0.5 ensures fair- 
ness. 


E SAFETY THEOREMS AND PROOFS 


In this section, we will prove the theorems for ensuring 
safety when using importance sampling for policy selec- 
tion. 


Theorem 6.4. For any two policies 71 and 1%, behavior 
policy Tp, & = WipuxVitartmixVutav and 6 < 0.5, Algo- 
rithm Q]is a safe policy selection algorithm with respect 
to >y with probability 1 — 6. 


Proof of Theorem{6.4| Let ye = her ww, Rit Le., 
the estimate of the value of policy 7 using only 7;). Now 
let 


= 0" 0 


Note that the range of X; is [—wypx Vai: Wax Vaiex!- Be- 
cause 7; and 7; are independent for all 1 # j, we know 
that X; is independent of X, for all i 4 7. Thus we can 


use Hoeffding’s inequality to find that: 


P(X -B [a] » Ww ODES 1 


and 


= 5 In(1 
Pr (x-213] ae | ie 
2n 
where Ww = Winx Vatax + Max Vax: Note that 
gat yx al emo an oe 
n = v n ae a a IS IS 


and E[X] = V™ — V™. Thus, substituting (1 — p)/2 
for y, we have that the following two statements hold 
with probability at least 1 — 2y = p, 


In(2/(1 — p)) 
2n 


T TT as ta) 
Vm vm > Vm vg 


and 


In(2/(1 — p)) 
2n 


VE vm Og 0 + 


Thus the probability that V™1 — V™2 < 0 but 1k = 


Vig? — yf BGO) > 0 or V™ — V™ > 0 but Vi? + 


ue = sae 2) < OQ is less than p, which means for 


p = 1-—, we will output the worse policy according to 
>v with probability at most 6, which is exactly what we 
need. 


Theorem 6.5. For any two policies 7, and 72, where 
Pr ces _ Van dal > 0) > 1 — duc any behavior 
policy Th, W = (Whtax ale Ver (Witax + 1)Viaw p= 
1—dmucéd for some 6 < 0.5, and for allk € {1, 2,3,. 
Algorithm jis a safe policy selection algorithm wie re- 
spect to >mc,kn with probability 1 — 6 when we have n 
samples drawn from T. It is a transitively fair policy 
selection algorithm whenever dyc < 0.25. 


Proof of Theorem|6.3| Recall that Algorithm [2] receives 
as input 7),... Consider additional random 
samples T[7,.:.,7., ~ Mi and 777,...,%, ~ Te. 
Note that these samples are all independent from each 
other. For 7 € {1,2,...,m}, let 


>Tn ~ Tb. 


(i.e., the estimate of the value of policy a using 
only samples 7/",73;,...7;;). Furthermore let Vig, = 


yar witRit. Now let 


= (V7 — V7) — (Vist, — Vig2) 


Notice that the range of X; is [—Vyjo.—Wmax VMax> Vaiax + 
Wwtax VMax|- Thus we can use Hoeffding’s inequality to 
find that: 


re(-B [A] > “ nel)» : 


Pr (x-2 1] cof) = 


where w = (Wypax + 1)Vytax + (Mex + DV atie- 


Notice that 
x=1y ox =2 yun - 0-0-0) 
A Vet 
= (Vate,en — Vereen) — (Vis? — Vis?) 
and 
E [x] = (V7 —-V™)—(V" —V™)=0 


Thus, substituting (1 — p)/2 for y, we have that the 
following two statements hold with probability at least 


1-—2y=p, 


In(2/(1 = p)) 


Vuic.kn — Vile kn — ou Vis’ an 

and 
Saas tals tas as In(2/(1 — p)) 
Vuickn ~ VYauc.kn S Vis — Vis? + 


2n 


Thus the probability that Vid x, — Vatean < 0 but 


(Vig Vig? )—w w/ mis PD > Oor We: kn 
0 but Ve — Vie + w+ / in@/O—p)) < 0 is less than 


1 — p. Now suppose wiiont tok of generality that 
PGE kn > Ve kn = Omc > 0.5. The probability 
that we output 72 is at most p/dmc. So if p = 1 — dmc, 
we output the worse policy with respect to >mMc,kn with 
probability at most 6, which is exactly what we need. 


2 
—ViB ken > 


As long as dmc < 0.25, we have that the algorithm is 
transitively safe since any fair algorithm with respect to 
>Mc,kn is transitively fair whenever Pr(|Vire - V3 > 
0) > 0.75 


Theorem 6.6. There exists policies 7, 12, and behavior 
policy 7% for which Algorithm[|with inputs as described 
in Theorem [6.4] is not a safe policy selection algorithm 
with respect to >yc,1 with p = 0.5 when we have a sin- 
gle sample drawn from Tp. 


Proof of Theorem(6.6| Consider a world where there are 
three trajectories: 7, with reward 0.0001, 72 with reward 
0.0002, and 73 with reward |. We want to select between 
two policies: 71, which places probability 1 on 72 and 72 
which places probability 0.51 on 7, and probability 0.49 
on 73. When we only have one sample from each policy, 
Peeve 1> Wie. 1) = 0.51 > 0.5, but clearly V™" < 
V™. Now consider using IS with behavior policy 7 
which places probability 0.48 on 7, and probability 0.01 
on T2 and probability 0.51 on 73. If we apply Algorithm] 
with the inputs to guarantee that the algorithm is safe 
with respect vy (as given in Theorem 6.4), we find that 
whenever 7 samples from 73, 


n4 

Vg — Vg? + (wh Vide + wie, Vea) 4 
0.49 oot sae 
= 0(1 1 - P 2 
(1) — PS ( 1+ (Gpp(0.0002) + ae ) 9 


= —0.060 < 0 


Since this event occurs with probability 0.51, we find that 
Algorithm 2] returns 72 more than half the time, indicat- 
ing that Algorithm Bis not a safe policy with respect to 


-MC,1 


